PERMUTOOLS: A MATLAB PACKAGE FOR MULTIVARIATE PERMUTATION TESTING

Statistical hypothesis testing and effect size measurement are routine parts of quantitative research. Advancements in computer processing power have greatly improved the capability of statistical inference through the availability of resampling methods. However, many of the statistical practices used today are based on traditional, parametric methods that rely on assumptions about the underlying population. These assumptions may not always be valid, leading to inaccurate results and misleading interpretations. Permutation testing, on the other hand, generates the sampling distribution empirically by permuting the observed data, providing distribution-free hypothesis testing. Furthermore, this approach lends itself to a powerful method for multiple comparison correction — known as max correction — which is less prone to type II errors than conventional correction methods. Parametric methods have also traditionally been utilized for estimating the confidence interval of various test statistics and effect size measures. However, these too can be estimated empirically using permutation or bootstrapping techniques. Whilst resampling methods are generally considered preferable, many popular programming languages and statistical software packages lack efficient implementations. Here, we introduce PERMUTOOLS, a MATLAB package for multivariate permutation testing and effect size measurement.


Background Hypothesis testing
For over a century, researchers have relied on parametric statistical procedures for conducting hypothesis testing, such as the famous Student's t-test [Student, 1908].However, parametric testing was developed out of the necessity to make inferences about the null distribution, as it was impractical to generate it empirically.Today, it is possible to do so using permutation tests.Permutation tests work by permuting the observed data in an appropriate manner to compute the empirical distribution of the test statistic of interest -known as the permutation distribution -which approaches the null distribution [Fisher, 1935].From the permutation distribution, we can estimate the confidence interval (CI) of the test statistic by computing the corresponding percentiles (e.g.2.5% and 97.5% percentiles for 95% CI).We can also estimate the probability of observing such a result by chance (i.e. the p-value) by calculating the proportion of the permutation distribution that is greater than or equal to the magnitude of the test statistic (Fig. 1).The more permutations generated, the more accurate the estimate.Typically, permutations in the order of several thousand are required to obtain reliable p-values, which is computationally trivial for modern computers [Ernst, 2004].
Permutation testing can be applied to any statistical test.As no assumptions are made about the shape of the underlying distribution, permutation tests provide distribution-free, nonparametric hypothesis testing without the need for rank-transformations [Holt and Sullivan, 2023].For independent samples, permutation tests have been shown to be relatively insensitive to differences in population variance when samples of equal size are used [Murphy, 1967, Groppe et al., 2011b].Moreover, permutation tests have been shown to be more accurate than parametric tests in fields such as biomedical research [Ludbrook and Dudley, 1998], and yield more reliable p-values and thus are more likely to produce replicable results [Noguchi et al., 2021].Despite the widespread use of significance thresholding, it is recommended to treat p-values as a continuous measure [McShane et al., 2019].However, their continuous value does not indicate the strength of an effect, only its likelihood.PERMUTOOLS offers permutation testing and confidence interval estimation for a range of statistical tests, including the ANOVA (one-way, two-way), t-test (one-sample, pairedsample, two-sample), F-test (two-sample), Z-test (one-sample), and correlation test (Pearson, Spearman, rankit).It does not output dichotomous test results based on an arbitrary significance threshold to discourage this practice.

Correcting for multiple comparisons
When conducting multiple hypothesis tests simultaneously, there is an increased risk of false discoveries or type I errors.However, many of the traditional correction methods used to control family-wise error rate (FWER) tend to be overly conservative, resulting in increased type II errors (e.g.Bonferroni correction, Holm-Bonferroni method).Researchers are often faced with the delicate task of controlling the trade-off between type I and type II errors.In the biomedical field, researchers tend to preference controlling for type I errors, because the consequences of false positives can be detrimental, e.g. the introduction of an ineffective new therapy or treatment [Ludbrook and Dudley, 1998].This inherent bias, along with the use of overly conservative correction methods, can be limiting in the pursuit of scientific progress.
Another advantage of permutation tests is that they can utilize a powerful technique for correcting for multiple comparisons -known as max correction -which is less prone to type II errors than conventional correction methods [Blair andKarniski, 1993, Westfall andYoung, 1993].Max correction (also known as t max correction [Blair et al., 1994] or joint correction [Boca et al., 2014]) works as follows: on each permutation of the data, a separate test statistic is computed for each variable and the maximum absolute value (or most extreme positive or negative value) is taken across all variables.Repeating this procedure thousands of times produces a single permutation distribution against which the actual test statistic is compared (Fig. 1).Thus, taking the maximum across more variables naturally produces a more conservative permutation distribution.This highly intuitive approach provides strong control of FWER, even for small sample sizes [Gondan, 2010, Groppe et al., 2011a, Rousselet, 2023].Max correction has been used in various scientific disciplines, including the study of electrophysiological data [Blair andKarniski, 1993, Groppe et al., 2011a,b] and human behavioural data [Gondan, 2010, Shaw et al., 2020, Crosse et al., 2022].PERMUTOOLS automatically applies max correction to multivariate data, unless specified otherwise.

Measuring effect size
Effect size measurement is an equally important part of inferential statistics [Hentschke and Stüttgen, 2011].Today, most scientific journals require that authors report the size of an effect, and not just its dichotomous existence.A common effect size measure used in research is the standardised mean difference, known as Cohen's d [Cohen, 1969].Standardised effect sizes have the advantage of being metric-free, meaning that they can be directly compared across different studies.For independent samples, Cohen's d uses the pooled standard deviation when they are assumed to have equal variances, and an unpooled estimate when this cannot be assumed.For independent samples with significantly different variances, an estimate based on the control sample's variance can be used, known as Glass' delta [Glass, 1976].In the case of ordinal data, an alternative formulation known as Cliff's d should be used [Cliff, 1993].PERMUTOOLS gives the option to implement any of the above standardised effect size measures, as well as several unstandardised measures.
It is also typical to report the confidence interval of an effect size.Whilst they have traditionally been calculated using parametric methods, these too can be estimated empirically by generating the sampling distribution using a resampling procedure known as bootstrapping.Bootstrapping is fundamentally different to permutation testing in that it resamples with replacement, whereas permutation testing resamples without replacement.PERMUTOOLS uses an efficient bootstrapping algorithm that is optimised for multivariate data analysis.

Correcting for sample size
As with all statistical tests, the larger the sample size, the more accurately it can describe the population.However, in certain fields such as cognitive science, researchers are often required to work with relatively small (n < 50) sample sizes due to the limited availability of test subjects or other logistical study constraints.This can hinder the researcher's ability to measure the true, unbiased size of an effect.Despite the advantages of standardised effect size measures, metrics such as Cohen's d have been shown to have an upwards bias of up to about 4% for sample sizes of less than 50.This bias is somewhat reduced by using the pooled weighted standard deviation of the samples [Hedges, 1981].Additionally, a bias correction factor can be applied to the effect size measure and CI, which is approximately equal to 1 − 3/(4n − 9) [Hedges and Olkin, 1985].When this correction factor is applied, it is usual to refer to the effect size as Hedges' g.PERMUTOOLS automatically applies bias correction when calculating Cohen's d and Glass' delta (and their CIs), unless specified otherwise.

Statement of need
Whilst resampling methods have gained widespread acceptance, many popular programming languages, including MATLAB, lack efficient implementations or have not yet fully integrated them into their core statistical packages.This hinders the adoption of these robust and versatile techniques by researchers, therefore limiting the quality and reliability of quantitative research.To address this need, PERMUTOOLS provides a comprehensive set of functions in the MATLAB programming language for conducting resampling-based inferential statistics that are easy to use and computationally efficient.Moreover, it is optimised for dealing with large, multivariate datasets and offers powerful methods for correcting for multiple comparisons and sample size, making it an invaluable tool for researchers across various fields of quantitative research.PERMUTOOLS offers a range of new features that distinguish it from existing statistical software packages.Some of the key features are described below.

Key features
• Optimised Resampling Algorithms: PERMUTOOLS utilizes efficient implementations of resampling algorithms that are optimised for multivariate data, ensuring efficient processing of even large datasets with negligible compute times.
• Multiple Comparison Correction: PERMUTOOLS implements the powerful max correction method to adjust p-values and CIs for multiple comparisons when dealing with multivariate data, reducing the risk of both type I and type II errors.
• Sample Size Correction: PERMUTOOLS applies a bias correction factor to the relevant standardised effect size measures and their CIs to adjust for any inherent inflation due to sample size, ensuring less biased estimates for small (n < 50) samples.
• Multivariate Processing: PERMUTOOLS handles multivariate datasets with ease, allowing for multiple tests to be performed simultaneously, as well as the option to perform pairwise comparisons between every combination of variables in a matrix (e.g.correlation matrix).
• Continuous Framework: PERMUTOOLS does not output test results under the traditional dichotomous decision-based framework (i.e.H = 0, 1).Instead, it outputs various quantitative measures, encouraging researchers to interpret their results in a continuous and holistic manner.

Using PERMUTOOLS
The PERMUTOOLS functions are designed to mimic the API of the equivalent parametric functions in MATLAB, with the addition of the prefix 'permu' at the beginning of the function name.For example, to conduct a permutation test based on the t-statistic, the usual function ttest() becomes permuttest() etc.The input and output arguments are the same as before with the exception that the first output variable is always the test statistic (and not a dichotomous test result).An additional output variable containing the sampling distribution is included, as well as additional resampling-related input arguments that are described in the help documentation of each function.Unlike many existing statistical software packages, PERMUTOOLS uses a consistent input/output argument framework across all its functions to optimise usability.

Example
The following example illustrates typical usage of the PERMUTOOLS toolbox.Here, we compare the means of two independent multivariate samples using permutation tests based on the t-statistic with max correction, and contrast the results with the equivalent parametric tests in MATLAB (i.e.two-sample t-tests).We then measure the associated effect sizes based on the bias-corrected standardised mean difference (i.e.Hedges' g), as well as the corresponding 95% CIs using both a bootstrapping and parametric approach.
First, we generate random multivariate data for two independent samples X and Y.Both samples have 20 variables, each with a mean value of approximately 0, except for the first 10 variables of Y which have a mean value of approximately −1.Each variable contains 30 observations.
% Generate random data rng (42) ; x = randn (30 ,20) ; y = randn (30 ,20) ; y (: ,1:10) = y (: ,1:10) -1; Because we generated the random multivariate data from the same (normal) distribution, we can assume that their variances are approximately equal.If we could not assume this, we would first conduct a two-tailed test of variance based on the F-statistic using PERMUTOOLS' permuvartest2() function.Thus, we can proceed using the standard Student's t-statistic (as opposed to Welch's t-statistic).The two-sample permutation tests are implemented using PERMUTOOLS' permuttest2() function and the equivalent parametric tests are implemented using MATLAB's ttest2() function.By default, max correction is applied to the permutation tests.
% Run MATLAB ' s parametric effect size analysis d3 = zeros (1 ,20) ; ci3 = zeros (2 ,20) ; for j = 1:20 stats3 = meanEffectSize ( x (: , j ) ,y (: , j ) , ' effect ' , ' cohen ' , ' paired ' ,0) ; d3 ( j ) = stats3 .Effect ; ci3 (: , j ) = stats3 .ConfidenceIntervals '; end % Run PERMUTOOLS ' bootstrapped effect size analysis [ d4 , ci4 , stats4 ] = booteffectsize (x ,y , ' effect ' , ' cohen ' , ' paired ' ,0) ; To compare the results of our parametric and resampling analyses, we next plot some of the statistics as a function of variable (Fig. 2).The effect of max correction is clearly visible on the test statistic CIs and the resulting p-values, which are consistently more conservative than those of the uncorrected parametric tests (Fig. 2, left, middle).Importantly, spurious effects observed in the parametric case (e.g.variable 20) did not survive max correction in the permutation tests.In the effect size analysis, the 95% CIs estimated via bootstrapping appear to approximate the parametric CIs reasonably well (Fig. 2 From the above statistical analysis, we can examine the results of each individual pairwise comparison between X and Y, which are contained in the variables output by the PERMUTOOLS functions.Taking the first variables of X and Y as an example, we see that the mean of X 1 (M = −0.06,SD = 0.91) was significantly greater than the mean of Y 1 (M = −1.09,SD = 0.86), even after adjusting for multiple comparisons (t(58) = 4.49, p = 0.0008, Hedges' g = 1.14, 95CI [0.68, 1.72]).We recommend always reporting the absolute values of the tests as shown here, in particular, with the effect size and CI included.

Future work
PERMUTOOLS is under active development, and new functions and features are constantly being added to it as needed.Future work aims to expand the toolbox to provide new permutation-based tests, including repeated measures ANOVAs (one-way, two-way), and multi-way ANOVAs (n-way), as well as expanding the functionality of existing ANOVA tests to be able to deal with unbalanced and multivariate samples.Whilst permutation tests have been shown to outperform nonparametric tests based on rank transformations [Holt and Sullivan, 2023], there are certain situations where a rank-transformation approach is desirable; for example, when dealing with ordinal data and outliers.To accommodate a wider range of data types, future work aims to develop permutation-based solutions for nonparametric tests based on rank transformations, e.g.Sign test, Wilcoxon signed rank test, Mann-Whitney U-test, Kruskal-Wallis test, Friedman test, etc. PERMUTOOLS is maintained by the corresponding author but accepts contributions from the research community at large.If you would like to contribute to PERMUTOOLS, or request that a specific statistical test or feature be added to it, please email the corresponding author at the email address provided above, or use the formal channels on GitHub, such as creating a pull request or opening an issue.

Figure 1 :
Figure1: Permutation distributions for two-tailed tests based on the t-statistic (Blue: uncorrected, Red: maxcorrected).Multivariate data were randomly generated to simulate two independent samples, both consisting of 20 variables (i.e.adjusted for 20 comparisons), each with 30 observations (see example code below).The uncorrected permutation distribution shown is that generated for the third test (i.e.variable 3).

Figure 2 :
Figure2: Results of the permutation tests and effect size analysis.Left: Test statistic (unstandardised) and 95% CI (parametric and permutation) for each test.Middle: P-value (parametric and permutation) for each test.Right: Effect size (Hedges' g) and 95% CI (parametric and bootstrapped) for each test.